Skip to content

Conversation

@harrisonGPU
Copy link
Contributor

@harrisonGPU harrisonGPU commented Nov 3, 2025

This patch adds support for the pattern:

  %index = select i1 %idx_sel, i32 0, i32 4
  %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index

by scaling the byte offset to an element index (index >> log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of using
scratch memory.

@llvmbot
Copy link
Member

llvmbot commented Nov 3, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

This patch adds support for the pattern:

  %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index

by scaling the byte offset to an element index (index >> log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of using
scratch memory.


Full diff: https://github.com/llvm/llvm-project/pull/166132.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+13-2)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll (+20)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index ddabd25894414..793c0237cdf38 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -456,10 +456,21 @@ static Value *GEPToVectorIndex(GetElementPtrInst *GEP, AllocaInst *Alloca,
   const auto &VarOffset = VarOffsets.front();
   APInt OffsetQuot;
   APInt::sdivrem(VarOffset.second, VecElemSize, OffsetQuot, Rem);
-  if (Rem != 0 || OffsetQuot.isZero())
-    return nullptr;
+
+  Value *Scaled = nullptr;
+  if (Rem != 0 || OffsetQuot.isZero()) {
+    unsigned ElemSizeShift = Log2_64(VecElemSize);
+    Scaled = Builder.CreateLShr(VarOffset.first, ElemSizeShift);
+    if (Instruction *NewInst = dyn_cast<Instruction>(Scaled))
+      NewInsts.push_back(NewInst);
+    OffsetQuot = APInt(BW, 1);
+    Rem = 0;
+  }
 
   Value *Offset = VarOffset.first;
+  if (Scaled)
+    Offset = Scaled;
+
   auto *OffsetType = dyn_cast<IntegerType>(Offset->getType());
   if (!OffsetType)
     return nullptr;
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..65bddaba8dd14 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -250,6 +250,26 @@ bb2:
   store i32 0, ptr addrspace(5) %extractelement
   ret void
 }
+
+define amdgpu_kernel void @scalar_alloca_vector_gep_i8(ptr %buffer, float %data, i32 %index) {
+; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_vector_gep_i8(
+; CHECK-SAME: ptr [[BUFFER:%.*]], float [[DATA:%.*]], i32 [[INDEX:%.*]]) {
+; CHECK-NEXT:    [[ALLOCA:%.*]] = freeze <3 x float> poison
+; CHECK-NEXT:    [[VEC:%.*]] = load <3 x float>, ptr [[BUFFER]], align 16
+; CHECK-NEXT:    [[TMP1:%.*]] = lshr i32 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <3 x float> [[VEC]], float [[DATA]], i32 [[TMP1]]
+; CHECK-NEXT:    store <3 x float> [[TMP2]], ptr [[BUFFER]], align 16
+; CHECK-NEXT:    ret void
+;
+  %alloca = alloca <3 x float>, align 16, addrspace(5)
+  %vec = load <3 x float>, ptr %buffer
+  store <3 x float> %vec, ptr addrspace(5) %alloca
+  %elt = getelementptr inbounds nuw i8, ptr addrspace(5) %alloca, i32 %index
+  store float %data, ptr addrspace(5) %elt, align 4
+  %updated = load <3 x float>, ptr addrspace(5) %alloca, align 16
+  store <3 x float> %updated, ptr %buffer, align 16
+  ret void
+}
 ;.
 ; CHECK: [[META0]] = !{}
 ; CHECK: [[RNG1]] = !{i32 0, i32 1025}

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the index here is somewhere in the middle? It doesn't look like we have any constraint on the index.


Value *Scaled = nullptr;
if (Rem != 0 || OffsetQuot.isZero()) {
unsigned ElemSizeShift = Log2_64(VecElemSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to validate VecElemSize is a power of 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I think it is not necessary to explicitly check whether the element size is a power of two, because it is already covered by the existing check here:

Type *VecEltTy = VectorTy->getElementType();
unsigned ElementSizeInBits = DL->getTypeSizeInBits(VecEltTy);
if (ElementSizeInBits != DL->getTypeAllocSizeInBits(VecEltTy)) {
LLVM_DEBUG(dbgs() << " Cannot convert to vector if the allocation size "
"does not match the type's size\n");
return false;
}

If the element type is not naturally aligned, it will return false, which also rejects non power of 2 element sizes, such as i24.

%updated = load <3 x float>, ptr addrspace(5) %alloca, align 16
store <3 x float> %updated, ptr %buffer, align 16
ret void
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test with non-power-of-2 element size

@harrisonGPU
Copy link
Contributor Author

What if the index here is somewhere in the middle? It doesn't look like we have any constraint on the index.

Could you please give me an example?

@shiltian
Copy link
Contributor

shiltian commented Nov 4, 2025

Could you please give me an example?

In the test case, the %index is for %alloca of <3 x float>, which is 12-byte, each of which has 4-byte. Since the GEP is of type i8, what if %index is 5, which will be somewhere in the middle of the 2nd element of %alloca.

@harrisonGPU harrisonGPU force-pushed the amdgpu/promote-vector branch from c1439a3 to 6a28740 Compare November 10, 2025 08:22
@harrisonGPU
Copy link
Contributor Author

harrisonGPU commented Nov 10, 2025

Could you please give me an example?

In the test case, the %index is for %alloca of <3 x float>, which is 12-byte, each of which has 4-byte. Since the GEP is of type i8, what if %index is 5, which will be somewhere in the middle of the 2nd element of %alloca.

Hi @shiltian , I’ve already thought about this issue, thank you very much for your suggestion and for pointing it out.
Now I think we should only promote when the variable index is guaranteed to be aligned to the element size.
We can use computeKnownBits and countMinTrailingZeros to check that the lower bits of the index are zero, which verifies its alignment before promoting.
I’ve updated the lit test and commit message accordingly. What do you think?

return nullptr;

Value *Offset = VarOffset.first;
if (Rem != 0 || OffsetQuot.isZero()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be this complicated? I thought checking whether offset % size would be sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree your points, I have removed it.

@github-actions
Copy link

🐧 Linux x64 Test Results

  • 186411 tests passed
  • 4868 tests skipped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants